AITopics | external validity

Collaborating Authors

external validity

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

RetiringAdult: NewDatasetsforFairMachineLearning

Neural Information Processing SystemsFeb-8-2026, 04:17:02 GMT

Although the fairness community has recognized the importance of data, re-searchers inthe area primarily rely on UCIAdult when itcomes totabular data.

artificial intelligence, intervention, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Puerto Rico (0.04)
Europe > France (0.04)

Genre: Research Report (0.69)

Industry: Law (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Off-Policy Evaluation and Learning for External Validity under a Covariate Shift

Neural Information Processing SystemsDec-23-2025, 16:42:18 GMT

We consider the evaluation and training of a new policy for the evaluation data by using the historical data obtained from a different policy. The goal of off-policy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of off-policy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data. Although the standard OPE and OPL assume the same distribution of covariate between the historical and evaluation data, there often exists a problem of a covariate shift,i.e., the distribution of the covariate of the historical data is different from that of the evaluation data. In this paper, we derive the efficiency bound of OPE under a covariate shift. Then, we propose doubly robust and efficient estimators for OPE and OPL under a covariate shift by using an estimator of the density ratio between the distributions of the historical and evaluation data. We also discuss other possible estimators and compare their theoretical properties. Finally, we confirm the effectiveness of the proposed estimators through experiments.

evaluation data, external validity, off-policy evaluation and learning, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Establishing Validity for Distance Functions and Internal Clustering Validity Indices in Correlation Space

Degen, Isabella, Abdallah, Zahraa S, Brown, Kate Robson, Reeve, Henry W J

arXiv.org Artificial IntelligenceDec-8-2025

Internal clustering validity indices (ICVIs) assess clustering quality without ground truth labels. Comparative studies consistently find that no single ICVI outperforms others across datasets, leaving practitioners without principled ICVI selection. We argue that inconsistent ICVI performance arises because studies evaluate them based on matching human labels rather than measuring the quality of the discovered structure in the data, using datasets without formally quantifying the structure type and quality. Structure type refers to the mathematical organisation in data that clustering aims to discover. Validity theory requires a theoretical definition of clustering quality, which depends on structure type. We demonstrate this through the first validity assessment of clustering quality measures for correlation patterns, a structure type that arises from clustering time series by correlation relationships. We formalise 23 canonical correlation patterns as the theoretical optimal clustering and use synthetic data modelling this structure with controlled perturbations to evaluate validity across content, criterion, construct, and external validity. Our findings show that Silhouette Width Criterion (SWC) and Davies-Bouldin Index (DBI) are valid for correlation patterns, whilst Calinski-Harabasz (VRC) and Pakhira-Bandyopadhyay-Maulik (PBM) indices fail. Simple Lp norm distances achieve validity, whilst correlation-specific functions fail structural, criterion, and external validity. These results differ from previous studies where VRC and PBM performed well, demonstrating that validity depends on structure type. Our structure-type-specific validation method provides both practical guidance (quality thresholds SWC>0.9, DBI<0.15) and a methodological template for establishing validity for other structure types.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2507.16497

Country: North America > United States > California (0.27)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.67)

Industry:

Health & Medicine (0.67)
Banking & Finance (0.45)

Technology:

Information Technology > Data Science > Data Mining (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.45)

Add feedback

Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation

Thomas, Danielle R., Borchers, Conrad, Koedinger, Kenneth R.

arXiv.org Artificial IntelligenceAug-4-2025

Humans can be notoriously imperfect evaluators. They are often biased, unreliable, and unfit to define "ground truth." Yet, given the surging need to produce large amounts of training data in educational applications using AI, traditional inter-rater reliability (IRR) metrics like Cohen's kappa remain central to validating labeled data. IRR remains a cornerstone of many machine learning pipelines for educational data. Take, for example, the classification of tutors' moves in dialogues or labeling open responses in machine-graded assessments. This position paper argues that overreliance on human IRR as a gatekeeper for annotation quality hampers progress in classifying data in ways that are valid and predictive in relation to improving learning. To address this issue, we highlight five examples of complementary evaluation methods, such as multi-label annotation schemes, expert-based approaches, and close-the-loop validity. We argue that these approaches are in a better position to produce training data and subsequent models that produce improved student learning and more actionable insights than IRR approaches alone. We also emphasize the importance of external validity, for example, by establishing a procedure of validating tutor moves and demonstrating that it works across many categories of tutor actions (e.g., providing hints). We call on the field to rethink annotation quality and ground truth--prioritizing validity and educational impact over consensus alone.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.00143

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Education > Educational Technology > Educational Software > Computer Based Training (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

PCA for Enhanced Cross-Dataset Generalizability in Breast Ultrasound Tumor Segmentation

Schmidt, Christian, Overhoff, Heinrich Martin

arXiv.org Artificial IntelligenceMay-30-2025

In medical image segmentation, limited external validity remains a critical obstacle when models are deployed across unseen datasets, an issue particularly pronounced in the ultrasound image domain. Existing solutions-such as domain adaptation and GAN-based style transfer-while promising, often fall short in the medical domain where datasets are typically small and diverse. This paper presents a novel application of principal component analysis (PCA) to address this limitation. PCA preprocessing reduces noise and emphasizes essential features by retaining approximately 90\% of the dataset variance. We evaluate our approach across six diverse breast tumor ultrasound datasets comprising 3,983 B-mode images and corresponding expert tumor segmentation masks. For each dataset, a corresponding dimensionality reduced PCA-dataset is created and U-Net-based segmentation models are trained on each of the twelve datasets. Each model trained on an original dataset was inferenced on the remaining five out-of-domain original datasets (baseline results), while each model trained on a PCA dataset was inferenced on five out-of-domain PCA datasets. Our experimental results indicate that using PCA reconstructed datasets, instead of original images, improves the model's recall and Dice scores, particularly for model-dataset pairs where baseline performance was lowest, achieving statistically significant gains in recall (0.57 $\pm$ 0.07 vs. 0.70 $\pm$ 0.05, $p = 0.0004$) and Dice scores (0.50 $\pm$ 0.06 vs. 0.58 $\pm$ 0.06, $p = 0.03$). Our method reduced the decline in recall values due to external validation by $33\%$. These findings underscore the potential of PCA reconstruction as a safeguard to mitigate declines in segmentation performance, especially in challenging cases, with implications for enhancing external validity in real-world medical applications.

artificial intelligence, dataset, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2505.23587

Country: Europe > Germany (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.89)
Health & Medicine > Therapeutic Area (0.69)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Off-Policy Evaluation and Learning for External Validity under a Covariate Shift

Neural Information Processing SystemsOct-9-2024, 09:25:15 GMT

evaluation data, external validity, off-policy evaluation and learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.43)

Add feedback

Leveraging Machine Learning for Official Statistics: A Statistical Manifesto

Puts, Marco, Salgado, David, Daas, Piet

arXiv.org Machine LearningSep-6-2024

It is important for official statistics production to apply ML with statistical rigor, as it presents both opportunities and challenges. Although machine learning has enjoyed rapid technological advances in recent years, its application does not possess the methodological robustness necessary to produce high quality statistical results. In order to account for all sources of error in machine learning models, the Total Machine Learning Error (TMLE) is presented as a framework analogous to the Total Survey Error Model used in survey methodology. As a means of ensuring that ML models are both internally valid as well as externally valid, the TMLE model addresses issues such as representativeness and measurement errors. There are several case studies presented, illustrating the importance of applying more rigor to the application of machine learning in official statistics.

official statistics, statistics, website, (16 more...)

arXiv.org Machine Learning

2409.04365

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Spain > Galicia > Madrid (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)

Genre: Research Report (1.00)

Industry: Government (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

AExGym: Benchmarks and Environments for Adaptive Experimentation

Wang, Jimmy, Che, Ethan, Jiang, Daniel R., Namkoong, Hongseok

arXiv.org Artificial IntelligenceAug-8-2024

Innovations across science and industry are evaluated using randomized trials (i.e., A/B tests). While simple and robust, such static designs are inefficient or infeasible for testing many hypotheses. Adaptive designs can greatly improve statistical power in theory, but they have seen limited adoption due to their fragility in practice. We present a benchmark for adaptive experimentation based on realworld datasets, highlighting prominent practical challenges to operationalizing adaptivity: non-stationarity, batched/delayed feedback, multiple outcomes and objectives, and external validity. Our benchmark aims to spur methodological development that puts practical performance (e.g., robustness) as a central concern, rather than mathematical guarantees on contrived instances. We release an opensource library, AExGym, which is designed with modularity and extensibility in mind to allow experimentation practitioners to develop and benchmark custom environments and algorithms.

algorithm, constraint, experiment, (14 more...)

arXiv.org Artificial Intelligence

2408.04531

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Georgia > Fulton County > Atlanta (0.14)
North America > United States > Pennsylvania (0.05)
(8 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > Strength High (0.88)

Industry:

Health & Medicine (1.00)
Information Technology (0.93)
Government (0.68)
Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Data Science > Data Mining > Big Data (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Evaluating AI Evaluation: Perils and Prospects

Burden, John

arXiv.org Artificial IntelligenceJul-12-2024

As AI systems appear to exhibit ever-increasing capability and generality, assessing their true potential and safety becomes paramount. This paper contends that the prevalent evaluation methods for these systems are fundamentally inadequate, heightening the risks and potential hazards associated with AI. I argue that a reformation is required in the way we evaluate AI systems and that we should look towards cognitive sciences for inspiration in our approaches, which have a longstanding tradition of assessing general intelligence across diverse species. We will identify some of the difficulties that need to be overcome when applying cognitively-inspired approaches to general-purpose AI systems and also analyse the emerging area of "Evals". The paper concludes by identifying promising research pathways that could refine AI evaluation, advancing it towards a rigorous scientific domain that contributes to the development of safe AI systems.

ai system, evaluation, peril and prospect, (12 more...)

arXiv.org Artificial Intelligence

2407.09221

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > New York > New York County > New York City (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
(15 more...)

Genre: Research Report > Experimental Study (0.67)

Industry:

Law (1.00)
Information Technology (1.00)
Health & Medicine > Therapeutic Area (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Applied AI (1.00)
(3 more...)

Add feedback

Towards Generalizing Inferences from Trials to Target Populations

Huang, Melody Y, Parikh, Harsh

arXiv.org Artificial IntelligenceMay-24-2024

Randomized Controlled Trials (RCTs) are pivotal in generating internally valid estimates with minimal assumptions, serving as a cornerstone for researchers dedicated to advancing causal inference methods. However, extending these findings beyond the experimental cohort to achieve externally valid estimates is crucial for broader scientific inquiry. This paper delves into the forefront of addressing these external validity challenges, encapsulating the essence of a multidisciplinary workshop held at the Institute for Computational and Experimental Research in Mathematics (ICERM), Brown University, in Fall 2023. The workshop congregated experts from diverse fields including social science, medicine, public health, statistics, computer science, and education, to tackle the unique obstacles each discipline faces in extrapolating experimental findings. Our study presents three key contributions: we integrate ongoing efforts, highlighting methodological synergies across fields; provide an exhaustive review of generalizability and transportability based on the workshop's discourse; and identify persistent hurdles while suggesting avenues for future research. By doing so, this paper aims to enhance the collective understanding of the generalizability and transportability of causal effects, fostering cross-disciplinary collaboration and offering valuable insights for researchers working on refining and applying causal inference methods.

arxiv preprint arxiv, target population, treatment effect, (14 more...)

arXiv.org Artificial Intelligence

2402.17042

Genre:

Research Report > Strength High (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback